library(TSstudio)
data("USVSales")
ts_info(USVSales)
## The USVSales series is a ts object with 1 variable and 528 observations
## Frequency: 12
## Start time: 1976 1
## End time: 2019 12
The USVSales series is a monthly ts object which represents the total vehicle sales in the US between 1976 and 2018 in thousands of units. Let’s plot the series and review its structure with the ts_plot function:
ts_plot(USVSales,
title = "US Total Monthly Vehicle Sales",
Ytitle = "Thousands of Units",
Xtitle = "Year")
As we can see in the preceding plot, the series has cycle patterns, which is common for a macro economy indicator. In this case, it is a macro indicator of the US economy.
We can get a deeper view of the series components by decomposing the series into its components and plotting them with the ts_decompose function:
ts_decompose(USVSales)
Beside the cycle-trend component, we can observe that the plot has a strong seasonal pattern, which we will explore next.
To get a closer look at the seasonal component of the series, we will subtract from the series, decompose the trend we discussed previously, and use the ts_seasonal function to plot the box plot of the seasonal component of the detrend series:
USVSales_detrend <- USVSales - decompose(USVSales)$trend
ts_seasonal(USVSales_detrend, type = "box")
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
## Warning: Ignoring 1 observations
We can see from the preceding seasonal plot that, typically, the peak of the year occurred during the months of March, May, and June. In addition, you can see that the sales decay from the summer months and peak again in December during the holiday seasons. On the other hand, the month of January is typically the lowest month of the year in terms of sales.
the USVSales series has a high correlation with its first seasonal lag. We can review this assessment with the use of the ts_acf (ts_cor) function from the TSstudio package for reviewing the autocorrelation of the series:
ts_cor(USVSales)
We can zoom in on the relationship of the series with the last three seasonal lags using the ts_lags function:
ts_lags(USVSales, lags = c(12, 24, 36))
The relationship of the series with the first and also second seasonal lags has a strong linear relationship, as shown in the preceding lags plot.
We can conclude our short exploratory analysis of the USVSales series with the following observations: + The USVSales series is a monthly series with a clear monthly seasonality + The series trend has a cyclic shape, and so the series has a cycle component embedded in the trend + The series’ most recent cycle starts right after the end of the 2008 economic crisis, between 2009 and 2010 + It seems that the current cycle reached its peak as the trend starts to flatten out + The series has a strong correlation with its first seasonal lag
Moreover, as we intend to have a short-term forecast (of 12 months), there is no point in using the full series, as it may enter some noise into the model due to the change of the trend direction every couple of years. (If you were trying to create a long-term forecast, then it may be a good idea to use all or most of the series.) Therefore, we will use the model training observations from 2010 and onward. We will use the ts_to_prophet function from the TSstudio package to transform the series from a ts object into a data.frame, and the window function to subset the series observations since January 2010:
df <- ts_to_prophet(window(USVSales, start = c(2005,1)))
names(df) <- c("date", "y")
head(df)
Before we move forward and start with the feature engineering stage, let’s plot and review the subset series of USVSales with the ts_plot function:
ts_plot(df,
title = "US Total Monthly Vehicle Sales (Subset)",
Ytitle = "Thousands of Units",
Xtitle = "Year")
Feature engineering plays a pivotal role when modeling with ML algorithms. Our next step, based on the preceding observations, is to create new features that can be used as informative input for the model. In the context of time series forecasting, here are some examples of possible new features that can be created from the series itself:
We will use the dplyr and lubridate packages to create those features, as we can see in the following code:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
df <- df %>%
mutate(month = factor(month(date, label = TRUE), ordered = FALSE), lag12 = lag(y, n = 12)) %>%
filter(!is.na(lag12))
We will then add the trend component and its second polynomial (trend squared):
df$trend <- 1:nrow(df)
df$trend_sqr <- df$trend ^ 2
Let’s view the structure of the df object after adding the new features:
str(df)
## 'data.frame': 168 obs. of 6 variables:
## $ date : Date, format: "2006-01-01" "2006-02-01" ...
## $ y : num 1176 1298 1577 1490 1534 ...
## $ month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ lag12 : num 1096 1286 1616 1542 1537 ...
## $ trend : int 1 2 3 4 5 6 7 8 9 10 ...
## $ trend_sqr: num 1 4 9 16 25 36 49 64 81 100 ...
There are additional feature engineering steps, which in the case of the USVSales series are not required, but may be required in other cases:
Since the values of the input series are ranging between 800 to 1,700, there is no need to scale the series and its inputs. In addition, since the h2o package supports the inputs as R factors, there is no need to apply hot encoding.
In order to compare the different models that we will be testing in this chapter, we will use the same inputs that we used previously. This includes executing same training and testing partitions throughout this chapter.
Since our forecast horizon is for 12 months, we will leave the last 12 months of the series as testing partitions and use the rest of the series as a training partition:
h <- 12
train_df <- df[1:(nrow(df) - h), ]
test_df <- df[(nrow(df) - h +1): nrow(df), ]
Previously, the h variable represented the forecast horizon, which, in this case, is also equal to the length of the testing partition. We will evaluate the model’s performance based on the MAPE score on the testing partition.
Note: One of the main characteristics of ML models is the tendency to overfit on the training set. Therefore, you should expect that the ratio between the error score on the testing and training partition will be relatively larger than the corresponding results of traditional time series models, such as ARIMA, Holt-Winters, and time series linear regression.
In addition to the training and testing partitions, we need to create the inputs for the forecast itself. We will create a data.frame with the dates of the following 12 months and build the rest of the features:
forecast_df <- data.frame(date = seq.Date(from = max(df$date) + month(1), length.out = h, by = "month"), trend = seq(from = max(df$trend) + 1, length.out = h, by = 1))
forecast_df$trend_sqr <- forecast_df$trend ^ 2
forecast_df$month <- factor(month(forecast_df$date, label = TRUE), ordered = FALSE)
Last but not least, we will extract the last 12 observations of the series from the df object and assign them as the future lags of the series:
forecast_df$lag12 <- tail(df$y, 12)
the performance of a forecasting model should be measured by the error rate, mainly on the testing partition, but also on the training partition. You should evaluate the performance of the model with respect to some baseline model. In the previous chapters, we benchmarked the forecast of the USgas series with the use of the seasonal naive model.
Since we are using a family of ML regression models, it makes more sense to use a regression model as a benchmark for the ML models. Therefore, we will train a time series linear regression model, Forecasting with Linear Regression. Using the training and testing partitions we created previously, let’s train the linear regression model and evaluate its performance with the testing partitions:
lr <- lm(y ~ month + lag12 + trend + trend_sqr, data = train_df)
We will use the summary function to review the model details:
summary(lr)
##
## Call:
## lm(formula = y ~ month + lag12 + trend + trend_sqr, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -472.75 -57.95 25.27 82.58 186.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.400e+02 1.051e+02 3.236 0.00151 **
## monthFeb 5.588e+01 5.389e+01 1.037 0.30152
## monthMar 1.644e+02 6.102e+01 2.695 0.00790 **
## monthApr 9.630e+01 5.700e+01 1.690 0.09331 .
## monthMay 1.577e+02 6.051e+01 2.606 0.01016 *
## monthJun 1.137e+02 5.835e+01 1.948 0.05343 .
## monthJul 8.452e+01 5.802e+01 1.457 0.14747
## monthAug 1.408e+02 5.990e+01 2.350 0.02016 *
## monthSep 7.564e+01 5.491e+01 1.378 0.17054
## monthOct 6.241e+01 5.398e+01 1.156 0.24957
## monthNov 5.023e+01 5.360e+01 0.937 0.35030
## monthDec 1.399e+02 5.926e+01 2.360 0.01965 *
## lag12 5.923e-01 7.443e-02 7.958 5.19e-13 ***
## trend 3.132e-01 1.280e+00 0.245 0.80713
## trend_sqr 8.091e-03 8.436e-03 0.959 0.33911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.5 on 141 degrees of freedom
## Multiple R-squared: 0.728, Adjusted R-squared: 0.701
## F-statistic: 26.96 on 14 and 141 DF, p-value: < 2.2e-16
library(report)
report(lr)
## We fitted a linear model (estimated using OLS) to predict y with month
## (formula: y ~ month + lag12 + trend + trend_sqr). The model explains a
## statistically significant and substantial proportion of variance (R2 = 0.73,
## F(14, 141) = 26.96, p < .001, adj. R2 = 0.70). The model's intercept,
## corresponding to month = Jan, is at 340.02 (95% CI [132.28, 547.77], t(141) =
## 3.24, p = 0.002). Within this model:
##
## - The effect of month [Feb] is statistically non-significant and positive (beta
## = 55.88, 95% CI [-50.65, 162.42], t(141) = 1.04, p = 0.302; Std. beta = 0.23,
## 95% CI [-0.21, 0.66])
## - The effect of month [Mar] is statistically significant and positive (beta =
## 164.44, 95% CI [43.80, 285.09], t(141) = 2.69, p = 0.008; Std. beta = 0.67, 95%
## CI [0.18, 1.16])
## - The effect of month [Apr] is statistically non-significant and positive (beta
## = 96.30, 95% CI [-16.38, 208.98], t(141) = 1.69, p = 0.093; Std. beta = 0.39,
## 95% CI [-0.07, 0.85])
## - The effect of month [May] is statistically significant and positive (beta =
## 157.65, 95% CI [38.04, 277.27], t(141) = 2.61, p = 0.010; Std. beta = 0.64, 95%
## CI [0.15, 1.13])
## - The effect of month [Jun] is statistically non-significant and positive (beta
## = 113.65, 95% CI [-1.70, 229.00], t(141) = 1.95, p = 0.053; Std. beta = 0.46,
## 95% CI [-6.93e-03, 0.93])
## - The effect of month [Jul] is statistically non-significant and positive (beta
## = 84.52, 95% CI [-30.20, 199.23], t(141) = 1.46, p = 0.147; Std. beta = 0.34,
## 95% CI [-0.12, 0.81])
## - The effect of month [Aug] is statistically significant and positive (beta =
## 140.76, 95% CI [22.35, 259.17], t(141) = 2.35, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of month [Sep] is statistically non-significant and positive (beta
## = 75.64, 95% CI [-32.91, 184.19], t(141) = 1.38, p = 0.171; Std. beta = 0.31,
## 95% CI [-0.13, 0.75])
## - The effect of month [Oct] is statistically non-significant and positive (beta
## = 62.41, 95% CI [-44.31, 169.14], t(141) = 1.16, p = 0.250; Std. beta = 0.25,
## 95% CI [-0.18, 0.69])
## - The effect of month [Nov] is statistically non-significant and positive (beta
## = 50.23, 95% CI [-55.73, 156.19], t(141) = 0.94, p = 0.350; Std. beta = 0.20,
## 95% CI [-0.23, 0.64])
## - The effect of month [Dec] is statistically significant and positive (beta =
## 139.86, 95% CI [22.70, 257.01], t(141) = 2.36, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of lag12 is statistically significant and positive (beta = 0.59,
## 95% CI [0.45, 0.74], t(141) = 7.96, p < .001; Std. beta = 0.60, 95% CI [0.45,
## 0.75])
## - The effect of trend is statistically non-significant and positive (beta =
## 0.31, 95% CI [-2.22, 2.84], t(141) = 0.24, p = 0.807; Std. beta = 0.06, 95% CI
## [-0.41, 0.52])
## - The effect of trend sqr is statistically non-significant and positive (beta =
## 8.09e-03, 95% CI [-8.59e-03, 0.02], t(141) = 0.96, p = 0.339; Std. beta = 0.24,
## 95% CI [-0.26, 0.74])
##
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation., We fitted a linear model
## (estimated using OLS) to predict y with lag12 (formula: y ~ month + lag12 +
## trend + trend_sqr). The model explains a statistically significant and
## substantial proportion of variance (R2 = 0.73, F(14, 141) = 26.96, p < .001,
## adj. R2 = 0.70). The model's intercept, corresponding to lag12 = 0, is at
## 340.02 (95% CI [132.28, 547.77], t(141) = 3.24, p = 0.002). Within this model:
##
## - The effect of month [Feb] is statistically non-significant and positive (beta
## = 55.88, 95% CI [-50.65, 162.42], t(141) = 1.04, p = 0.302; Std. beta = 0.23,
## 95% CI [-0.21, 0.66])
## - The effect of month [Mar] is statistically significant and positive (beta =
## 164.44, 95% CI [43.80, 285.09], t(141) = 2.69, p = 0.008; Std. beta = 0.67, 95%
## CI [0.18, 1.16])
## - The effect of month [Apr] is statistically non-significant and positive (beta
## = 96.30, 95% CI [-16.38, 208.98], t(141) = 1.69, p = 0.093; Std. beta = 0.39,
## 95% CI [-0.07, 0.85])
## - The effect of month [May] is statistically significant and positive (beta =
## 157.65, 95% CI [38.04, 277.27], t(141) = 2.61, p = 0.010; Std. beta = 0.64, 95%
## CI [0.15, 1.13])
## - The effect of month [Jun] is statistically non-significant and positive (beta
## = 113.65, 95% CI [-1.70, 229.00], t(141) = 1.95, p = 0.053; Std. beta = 0.46,
## 95% CI [-6.93e-03, 0.93])
## - The effect of month [Jul] is statistically non-significant and positive (beta
## = 84.52, 95% CI [-30.20, 199.23], t(141) = 1.46, p = 0.147; Std. beta = 0.34,
## 95% CI [-0.12, 0.81])
## - The effect of month [Aug] is statistically significant and positive (beta =
## 140.76, 95% CI [22.35, 259.17], t(141) = 2.35, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of month [Sep] is statistically non-significant and positive (beta
## = 75.64, 95% CI [-32.91, 184.19], t(141) = 1.38, p = 0.171; Std. beta = 0.31,
## 95% CI [-0.13, 0.75])
## - The effect of month [Oct] is statistically non-significant and positive (beta
## = 62.41, 95% CI [-44.31, 169.14], t(141) = 1.16, p = 0.250; Std. beta = 0.25,
## 95% CI [-0.18, 0.69])
## - The effect of month [Nov] is statistically non-significant and positive (beta
## = 50.23, 95% CI [-55.73, 156.19], t(141) = 0.94, p = 0.350; Std. beta = 0.20,
## 95% CI [-0.23, 0.64])
## - The effect of month [Dec] is statistically significant and positive (beta =
## 139.86, 95% CI [22.70, 257.01], t(141) = 2.36, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of lag12 is statistically significant and positive (beta = 0.59,
## 95% CI [0.45, 0.74], t(141) = 7.96, p < .001; Std. beta = 0.60, 95% CI [0.45,
## 0.75])
## - The effect of trend is statistically non-significant and positive (beta =
## 0.31, 95% CI [-2.22, 2.84], t(141) = 0.24, p = 0.807; Std. beta = 0.06, 95% CI
## [-0.41, 0.52])
## - The effect of trend sqr is statistically non-significant and positive (beta =
## 8.09e-03, 95% CI [-8.59e-03, 0.02], t(141) = 0.96, p = 0.339; Std. beta = 0.24,
## 95% CI [-0.26, 0.74])
##
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation., We fitted a linear model
## (estimated using OLS) to predict y with trend (formula: y ~ month + lag12 +
## trend + trend_sqr). The model explains a statistically significant and
## substantial proportion of variance (R2 = 0.73, F(14, 141) = 26.96, p < .001,
## adj. R2 = 0.70). The model's intercept, corresponding to trend = 0, is at
## 340.02 (95% CI [132.28, 547.77], t(141) = 3.24, p = 0.002). Within this model:
##
## - The effect of month [Feb] is statistically non-significant and positive (beta
## = 55.88, 95% CI [-50.65, 162.42], t(141) = 1.04, p = 0.302; Std. beta = 0.23,
## 95% CI [-0.21, 0.66])
## - The effect of month [Mar] is statistically significant and positive (beta =
## 164.44, 95% CI [43.80, 285.09], t(141) = 2.69, p = 0.008; Std. beta = 0.67, 95%
## CI [0.18, 1.16])
## - The effect of month [Apr] is statistically non-significant and positive (beta
## = 96.30, 95% CI [-16.38, 208.98], t(141) = 1.69, p = 0.093; Std. beta = 0.39,
## 95% CI [-0.07, 0.85])
## - The effect of month [May] is statistically significant and positive (beta =
## 157.65, 95% CI [38.04, 277.27], t(141) = 2.61, p = 0.010; Std. beta = 0.64, 95%
## CI [0.15, 1.13])
## - The effect of month [Jun] is statistically non-significant and positive (beta
## = 113.65, 95% CI [-1.70, 229.00], t(141) = 1.95, p = 0.053; Std. beta = 0.46,
## 95% CI [-6.93e-03, 0.93])
## - The effect of month [Jul] is statistically non-significant and positive (beta
## = 84.52, 95% CI [-30.20, 199.23], t(141) = 1.46, p = 0.147; Std. beta = 0.34,
## 95% CI [-0.12, 0.81])
## - The effect of month [Aug] is statistically significant and positive (beta =
## 140.76, 95% CI [22.35, 259.17], t(141) = 2.35, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of month [Sep] is statistically non-significant and positive (beta
## = 75.64, 95% CI [-32.91, 184.19], t(141) = 1.38, p = 0.171; Std. beta = 0.31,
## 95% CI [-0.13, 0.75])
## - The effect of month [Oct] is statistically non-significant and positive (beta
## = 62.41, 95% CI [-44.31, 169.14], t(141) = 1.16, p = 0.250; Std. beta = 0.25,
## 95% CI [-0.18, 0.69])
## - The effect of month [Nov] is statistically non-significant and positive (beta
## = 50.23, 95% CI [-55.73, 156.19], t(141) = 0.94, p = 0.350; Std. beta = 0.20,
## 95% CI [-0.23, 0.64])
## - The effect of month [Dec] is statistically significant and positive (beta =
## 139.86, 95% CI [22.70, 257.01], t(141) = 2.36, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of lag12 is statistically significant and positive (beta = 0.59,
## 95% CI [0.45, 0.74], t(141) = 7.96, p < .001; Std. beta = 0.60, 95% CI [0.45,
## 0.75])
## - The effect of trend is statistically non-significant and positive (beta =
## 0.31, 95% CI [-2.22, 2.84], t(141) = 0.24, p = 0.807; Std. beta = 0.06, 95% CI
## [-0.41, 0.52])
## - The effect of trend sqr is statistically non-significant and positive (beta =
## 8.09e-03, 95% CI [-8.59e-03, 0.02], t(141) = 0.96, p = 0.339; Std. beta = 0.24,
## 95% CI [-0.26, 0.74])
##
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation. and We fitted a linear
## model (estimated using OLS) to predict y with trend_sqr (formula: y ~ month +
## lag12 + trend + trend_sqr). The model explains a statistically significant and
## substantial proportion of variance (R2 = 0.73, F(14, 141) = 26.96, p < .001,
## adj. R2 = 0.70). The model's intercept, corresponding to trend_sqr = 0, is at
## 340.02 (95% CI [132.28, 547.77], t(141) = 3.24, p = 0.002). Within this model:
##
## - The effect of month [Feb] is statistically non-significant and positive (beta
## = 55.88, 95% CI [-50.65, 162.42], t(141) = 1.04, p = 0.302; Std. beta = 0.23,
## 95% CI [-0.21, 0.66])
## - The effect of month [Mar] is statistically significant and positive (beta =
## 164.44, 95% CI [43.80, 285.09], t(141) = 2.69, p = 0.008; Std. beta = 0.67, 95%
## CI [0.18, 1.16])
## - The effect of month [Apr] is statistically non-significant and positive (beta
## = 96.30, 95% CI [-16.38, 208.98], t(141) = 1.69, p = 0.093; Std. beta = 0.39,
## 95% CI [-0.07, 0.85])
## - The effect of month [May] is statistically significant and positive (beta =
## 157.65, 95% CI [38.04, 277.27], t(141) = 2.61, p = 0.010; Std. beta = 0.64, 95%
## CI [0.15, 1.13])
## - The effect of month [Jun] is statistically non-significant and positive (beta
## = 113.65, 95% CI [-1.70, 229.00], t(141) = 1.95, p = 0.053; Std. beta = 0.46,
## 95% CI [-6.93e-03, 0.93])
## - The effect of month [Jul] is statistically non-significant and positive (beta
## = 84.52, 95% CI [-30.20, 199.23], t(141) = 1.46, p = 0.147; Std. beta = 0.34,
## 95% CI [-0.12, 0.81])
## - The effect of month [Aug] is statistically significant and positive (beta =
## 140.76, 95% CI [22.35, 259.17], t(141) = 2.35, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of month [Sep] is statistically non-significant and positive (beta
## = 75.64, 95% CI [-32.91, 184.19], t(141) = 1.38, p = 0.171; Std. beta = 0.31,
## 95% CI [-0.13, 0.75])
## - The effect of month [Oct] is statistically non-significant and positive (beta
## = 62.41, 95% CI [-44.31, 169.14], t(141) = 1.16, p = 0.250; Std. beta = 0.25,
## 95% CI [-0.18, 0.69])
## - The effect of month [Nov] is statistically non-significant and positive (beta
## = 50.23, 95% CI [-55.73, 156.19], t(141) = 0.94, p = 0.350; Std. beta = 0.20,
## 95% CI [-0.23, 0.64])
## - The effect of month [Dec] is statistically significant and positive (beta =
## 139.86, 95% CI [22.70, 257.01], t(141) = 2.36, p = 0.020; Std. beta = 0.57, 95%
## CI [0.09, 1.05])
## - The effect of lag12 is statistically significant and positive (beta = 0.59,
## 95% CI [0.45, 0.74], t(141) = 7.96, p < .001; Std. beta = 0.60, 95% CI [0.45,
## 0.75])
## - The effect of trend is statistically non-significant and positive (beta =
## 0.31, 95% CI [-2.22, 2.84], t(141) = 0.24, p = 0.807; Std. beta = 0.06, 95% CI
## [-0.41, 0.52])
## - The effect of trend sqr is statistically non-significant and positive (beta =
## 8.09e-03, 95% CI [-8.59e-03, 0.02], t(141) = 0.96, p = 0.339; Std. beta = 0.24,
## 95% CI [-0.26, 0.74])
##
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.
Next, we will predict the corresponding values of the series on the testing partition with the predict function by using test_df as input:
test_df$yhat <- predict(lr, newdata = test_df)
Now, we can evaluate the model’s performance on the testing partition:
mape_lr <- mean(abs(test_df$y - test_df$yhat) / test_df$y)
mape_lr
## [1] 0.08551179
Hence, the MAPE score of the linear regression forecasting model is 4.04%. We will use this to benchmark the performance of the ML models.
The h2o package is based on the use of distributed and parallel computing in order to speed up the compute time and be able to scale up for big data. All of this is done on either in-memory (based on the computer’s internal RAM) or parallel distributed processing (for example, AWS, Google Cloud, and so on) clusters. Therefore, we will load the package and then set the in-memory cluster with the h2o.init function:
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:lubridate':
##
## day, hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init(max_mem_size = "16G")
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 1 minutes
## H2O cluster timezone: America/Los_Angeles
## H2O data parsing timezone: UTC
## H2O cluster version: 3.38.0.1
## H2O cluster version age: 5 months and 19 days !!!
## H2O cluster name: H2O_started_from_R_howardnguyen_arr870
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.15 GB
## H2O cluster total cores: 10
## H2O cluster allowed cores: 10
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.2.2 (2022-10-31)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (5 months and 19 days)!
## Please download and install the latest version from http://h2o.ai/download/
h2o.init allows you to set the memory size of the cluster with the max_mem_size argument. The output of the function, as shown in the preceding code, provides information about the cluster’s setup.
Any data that is used throughout the training and testing process of the models by the h2o package must load to the cluster itself. The as.h2o function allows us to transform any data.frame object into a h2o cluster:
train_h <- as.h2o(train_df)
##
|
| | 0%
|
|======================================================================| 100%
test_h <- as.h2o(test_df)
##
|
| | 0%
|
|======================================================================| 100%
In addition, we will transform the forecast_df object (the future values of the series inputs) into an h2o object, which will be used to generate, later on in this chapter, the final forecast:
forecast_h <- as.h2o(forecast_df)
##
|
| | 0%
|
|======================================================================| 100%
For our convenience, we will label the names of the dependent and independent variables:
x <- c("month", "lag12", "trend", "trend_sqr")
y <- "y"
Now that the data has been loaded into the working cluster, we can start the training process.
The h2o package provides a set of tools for training and testing ML models. The most common model training approaches are as follows:
we will use the cross-validation (CV) approach to train these models.
Now that we have prepared the data, created new features, and launched a h2o cluster, we are ready to build our first forecasting model with the Random Forest (RF) algorithm. The RF algorithm is one of the most popular ML models, and it can be used for both classification and regression problems. In a nutshell, the RF algorithm is based on an ensemble of multiple tree models.
As its name implies, it has two main components:
After the forest is built, the algorithm ensembles the prediction of all the trees in the forest into one output. This combination of randomizing the input for each tree model and then averaging their results reduces the likelihood of overfitting the model.
RF has several tuning parameters that allow you to control the level of randomization of the sampling process and how deep the forest is. The h2o.randomForest function from the h2o package provides the framework for training and tuning the RF model. The following are the main parameters of the h2o.randomForest function:
In addition, this function has several control arguments, which allows you to control the running time of the model. Furthermore, they allow you to set a stop criteria for the model if adding additional trees doesn’t improve the model’s performance. These arguments are as follows:
We will start with a simplistic RF model by using 500 trees and 5 folder CV training. In addition, we will add a stop criteria to prevent the model from fitting the model while there is no significant change in the model’s performance. In this case, we will set the stopping metric as RMSE, the stopping tolerance as 0.0001, and the stopping rounds to 10:
rf_md <- h2o.randomForest(training_frame = train_h,
nfolds = 5,
x = x,
y = y,
ntrees = 500,
stopping_rounds = 10,
stopping_metric = "RMSE",
score_each_iteration = TRUE,
stopping_tolerance = 0.0001,
seed = 1234)
##
|
| | 0%
|
|======================================================================| 100%
The h2o.randomForest function returns an object with information about the parameter settings of the model and its performance on the training set (and validation, if used). We can view the contribution of the model inputs with the h2o.varimp_plot function. This function returns a plot with the ranking of the input variables’ contribution to the model performance using a scale between 0 and 1, as shown in the following code:
h2o.varimp_plot(rf_md)
As we can see from the preceding variable importance plot, the lag variable, lag12, is the most important to the model. This shouldn’t be a surprise as we saw the strong relationship between the series and its seasonal lag in the correlation analysis. Right after this, the most important variables are trend_sqr, month, and trend.
The output of the model contains (besides the model itself) information about the model’s performance and parameters. Let’s review the model summary:
rf_md@model$model_summary